22
Quantization of Neural Networks
MHSA
MLP
Add & Norm
Add & Norm
Classifier
Input
query
key
value
Attention score
Matrix
Multiplication
MHSA
MLP
Add & Norm
Add & Norm
L
L
Patch Embedding
Teacher activations
Distribution Guided Distillation (DGD)
Information Rectification Module (IRM)
FIGURE 2.4
Overview of Q-ViT, applying Information Rectification Module (IRM) for maximizing rep-
resentation information and Distribution Guided Distillation (DGD) for accurate optimiza-
tion.
inevitably deteriorates the attention module’s representation capability in capturing the in-
put’s global dependency. Second, the distillation for the fully quantized ViT baseline utilizes
a distillation token (following [224]) to directly supervise the quantized ViT classification
output. However, we found that such a simple supervision could be more effective, which
is coarse-grained because of the large gap between the quantized attention scores and their
full-precision counterparts.
To address the issues above, a fully quantized ViT (Q-ViT) [136] is developed by retain-
ing the distribution of quantized attention modules as that of full-precision counterparts (see
the overview in Fig. 2.4). Accordingly, we propose to modify the distorted distribution over
quantized attention modules through an Information Rectification Module (IRM) based
on information entropy maximization in the forward process. In the backward process, we
present a distribution-guided distillation (DGD) scheme to eliminate the distribution vari-
ation through attention similarity loss between the quantized ViT and the full-precision
counterpart.
2.3.1
Baseline of Fully Quantized ViT
First, we build a baseline to study fully quantized ViT since it has never been proposed in
previous work. A straightforward solution is quantifying the representations (weights and
activations) in ViT architecture in the forward propagation and applying distillation to the
optimization in the backward propagation.
Quantized ViT architecture.
We briefly introduce the technology of neural network
quantization. We first introduce a general asymmetric activation quantization and symmet-
ric weight quantization scheme as
Qa(x) = ⌊clip{(x −z)/αx, −Qx
n, Qx
p}⌉
Qw(w) = ⌊clip{w/αw, −Qw
n , Qw
p }⌉
ˆx = Qa(x) × αx + z,
ˆw = Qw(w) × αw.
(2.13)
Here, clip{y, r1, r2} returns y with values below r1 set as r1 and values above r2 set as r2,
and ⌊y⌉rounds y to the nearest integer. With quantization of activations on signed a bits
and weights to signed b bits, Qx
n = 2a−1, Qx
p = 2a−1 −1 and Qw
n = 2b−1, Qw
p = 2b−1 −1. In
general, the forward and backward propagation of the quantization function in the quantized